Machine Learning Analysis Report

Generated on August 03, 2025 at 09:00 PM

Machine Learning Analysis Pipeline

EDR: Dataset Loading & Preprocessing

EDR – Train/Test Overview
• Train shape: (88089, 20) | Test shape: (7533, 20)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902

EDR: Model Performance Comparison

EDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.96270.55970.04800.14860.07260.68220.0512
Random Forest (SMOTE)0.98960.54660.38890.09460.15220.70530.1069
LightGBM0.98910.53960.30000.08110.12770.80570.0800
Balanced RF0.90460.66410.04380.41890.07940.84540.0657
SGD SVM0.97070.56370.06510.14860.0905nannan
IsolationForest0.97730.54700.07080.10810.0856nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression724121863112.92%85.14%
Random Forest (SMOTE)7448116770.15%90.54%
LightGBM7445146860.19%91.89%
Balanced RF678367643319.06%58.11%
SGD SVM730115863112.12%85.14%
IsolationForest73541056681.41%89.19%

Best Models by Metric

Accuracy
Random Forest (SMOTE)
0.9896
Balanced Acc
Balanced RF
0.6641
Precision
Random Forest (SMOTE)
0.3889
Recall
Balanced RF
0.4189
F1
Random Forest (SMOTE)
0.1522
ROC-AUC
Balanced RF
0.8454
PR-AUC
Random Forest (SMOTE)
0.1069
Lowest False Positive Rate
Random Forest (SMOTE)
0.15%
Lowest Miss Rate
Balanced RF
58.11%

EDR – Metrics by Model

EDR – Metrics by Model

EDR – ROC Curves

EDR – ROC Curves

EDR – Precision–Recall Curves

EDR – Precision–Recall Curves

EDR – Predicted Probability Distributions

EDR – Predicted Probability Distributions

EDR – Threshold Sweep

EDR – Threshold Sweep

EDR: Logistic Regression – Detailed Analysis

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99140.97080.98107459.0000
10.04800.14860.072674.0000
accuracynannan0.96277533.0000

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR: Random Forest (SMOTE) – Detailed Analysis

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99110.99850.99487459.0000
10.38890.09460.152274.0000
accuracynannan0.98967533.0000

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR: LightGBM – Detailed Analysis

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99090.99810.99457459.0000
10.30000.08110.127774.0000
accuracynannan0.98917533.0000

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR: Balanced RF – Detailed Analysis

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99370.90940.94977459.0000
10.04380.41890.079474.0000
accuracynannan0.90467533.0000

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR: SGD SVM – Detailed Analysis

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99140.97880.98517459.0000
10.06510.14860.090574.0000
accuracynannan0.97077533.0000

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR: IsolationForest – Detailed Analysis

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99110.98590.98857459.0000
10.07080.10810.085674.0000
accuracynannan0.97737533.0000

EDR – IsolationForest: Feature Importance

Feature importance not available for this model type.

XDR: Dataset Loading & Preprocessing

XDR – Train/Test Overview
• Train shape: (88089, 34) | Test shape: (7533, 34)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902

XDR: Model Performance Comparison

XDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.82500.63740.02520.44590.04770.65600.0459
Random Forest (SMOTE)0.98980.53990.40000.08110.13480.68970.1147
LightGBM0.98990.51320.33330.02700.05000.84400.0773
Balanced RF0.91900.67810.05330.43240.09500.84590.0597
SGD SVM0.88520.59420.02630.29730.0484nannan
IsolationForest0.98700.51850.10000.04050.0577nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression61821277413317.12%55.41%
Random Forest (SMOTE)745096860.12%91.89%
LightGBM745547220.05%97.30%
Balanced RF689156842327.61%56.76%
SGD SVM6646813522210.90%70.27%
IsolationForest7432277130.36%95.95%

Best Models by Metric

Accuracy
LightGBM
0.9899
Balanced Acc
Balanced RF
0.6781
Precision
Random Forest (SMOTE)
0.4000
Recall
Logistic Regression
0.4459
F1
Random Forest (SMOTE)
0.1348
ROC-AUC
Balanced RF
0.8459
PR-AUC
Random Forest (SMOTE)
0.1147
Lowest False Positive Rate
LightGBM
0.05%
Lowest Miss Rate
Logistic Regression
55.41%

XDR – Metrics by Model

XDR – Metrics by Model

XDR – ROC Curves

XDR – ROC Curves

XDR – Precision–Recall Curves

XDR – Precision–Recall Curves

XDR – Predicted Probability Distributions

XDR – Predicted Probability Distributions

XDR – Threshold Sweep

XDR – Threshold Sweep

XDR: Logistic Regression – Detailed Analysis

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99340.82880.90377459.0000
10.02520.44590.047774.0000
accuracynannan0.82507533.0000

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR: Random Forest (SMOTE) – Detailed Analysis

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99100.99880.99497459.0000
10.40000.08110.134874.0000
accuracynannan0.98987533.0000

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR: LightGBM – Detailed Analysis

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99040.99950.99497459.0000
10.33330.02700.050074.0000
accuracynannan0.98997533.0000

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR: Balanced RF – Detailed Analysis

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99390.92390.95767459.0000
10.05330.43240.095074.0000
accuracynannan0.91907533.0000

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR: SGD SVM – Detailed Analysis

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99220.89100.93897459.0000
10.02630.29730.048474.0000
accuracynannan0.88527533.0000

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR: IsolationForest – Detailed Analysis

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99050.99640.99357459.0000
10.10000.04050.057774.0000
accuracynannan0.98707533.0000

XDR – IsolationForest: Feature Importance

Feature importance not available for this model type.